Experiments on Document Chunking and Query Formation for Plagiarism Source Retrieval

نویسندگان

  • Amit Prakash
  • Sujan Kumar Saha
چکیده

This paper presents the details of the system we prepare as a participant of the PAN 2014 task on 'Source Retrieval: Uncovering Plagiarism, Authorship, and Social Software Misuse'. Our work is focused on intelligent chunking of suspicious documents and a hybrid approach of query formation. A method based on term frequency and word co-occurrence is proposed to extract query terms from a non-overlapping chunk of topically related sentences. The queries are then submitted to the ChatNoir search API to retrieve documents that are likely to be the sources of plagiarism. Finally a snippet matching and duplicate download restriction based filtering technique reduces the number of downloads. The evaluation results of the PAN14 Source Retrieval task show that the performance of our system is highly promising. The f-measure accuracy of the system is .3871 with a recall of .5083 which is the highest among all the participants.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Efficient Paragraph based Chunking and Download Filtering for Plagiarism Source Retrieval

This paper describes the approach of the system that we built as part of the participation in ‘PAN 2015 Source Retrieval’ task. Chunking of documents based on paragraphs and efficient download filtering improved the overall performance of the system. Source Retrieval is an important task of a Plagiarism Detection system

متن کامل

Source Retrieval and Text Alignment Corpus Construction for Plagiarism Detection

For the task of source retrieval, we focus on the process of Download Filtering. For the process from chunking to search control, we aim at high recall, and for the process of download filtering, we devote to improve precision. A vote-based approach and a classification-based approach are incorporated to filter the searching results to get the plagiarism sources. For the task of text alignment ...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014